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ABSTRACT 

This paper presents a new approach for unsupervised Spoken Term De¬ 
tection with spoken queries using multiple sets of acoustic patterns auto¬ 
matically discovered from the target corpus. The different pattern HMM 
configurations(number of states per model, number of distinct models, 
number of Gaussians per state)form a three-dimensional model granu¬ 
larity space. Different sets of acoustic patterns automatically discovered 
on different points properly distributed over this three-dimensional space 
are complementary to one another, thus can jointly capture the character¬ 
istics of the spoken terms. By representing the spoken content and spo¬ 
ken query as sequences of acoustic patterns, a series of approaches for 
matching the pattern index sequences while considering the signal vari¬ 
ations are developed. In this way, not only the on-line computation load 
can be reduced, but the signal distributions caused by different speak¬ 
ers and acoustic conditions can be reasonably taken care of. The results 
indicate that this approach significantly outperformed the unsupervised 
feature-based DTW baseline by 16.16% in mean average precision on 
the TIMIT corpus. 

Index Terms — zero resource speech recognition, unsupervised 
learning, dynamic time warping, hidden Markov models, spoken term 
detection 

1. INTRODUCTION 

The fast growing quantity of video and audio content over the Inter¬ 
net implies a very high demand for efficient and accurate approaches 
to search through the spoken contents. Spoken term detection (STD) 
usually refers to the task of finding all occurrences of the text query 
term from a large spoken archive [1]. Most STD approaches were based 
on automatic speech recognition (ASR), transforming speech into words 
or subwords for token matching □OIHQ), with performance relying 
heavily on the ASR accuracy (S). This implies annotated training cor¬ 
pora properly matched to the spoken content are necessary. When the 
input query is spoken, it becomes possible to directly match the spo¬ 
ken content with the spoken query without conventional ASR. In this 
way, the difficulties in conventional ASR such as recognition errors and 
need for annotated training data may be bypassed, which is especially at¬ 
tractive for languages with very limited annotated data mil) or spoken 
content with unknown languages. This leads to recent efforts in unsu¬ 
pervised STD with spoken queries without using annotated data during 
training ID m. which is also the focus of this work. Hereafter we as¬ 
sume all queries are in spoken, and no annotated speech data is available. 

Prevailing approaches to the task considered here rely on dynamic 
time warping (DTW) to directly match the spoken query to the spoken 
documents. However, a major limitation of DTW-based approaches is 
that the DTW distances are easily affected by speaker mismatch and 
varying acoustic conditions. Many related works focused on feature 


representations and distance measures within the DTW framework that 
are more robust to speaker and acoustic condition variations (To), includ¬ 
ing using the posteriorgrams from a universal Gaussian mixture model 
m , and the acoustic segment models fl^ fT3l . The other limitation for 
DTW-based approaches is that the computation load for the matching 
process is linear to the number of frames to be searched through. Sub¬ 
stantial efforts were devoted to reducing this computation load, such as 
segment-based DTW (9), lowerbound estimation for DTW fldlflSl , and 
a locality sensitive hashing technique for indexing speech frames HD. 

In recent years substantial effort has been made for unsupervised 
model based discovery of acoustic patterns from corpora without man¬ 
ual annotation fTTl fT^ fT^ f20l l2TI l22l . In this paper, we propose a 
new approach for unsupervised STD with spoken queries using multi¬ 
level acoustic patterns automatically discovered from the target corpus 
with varying model granularity discovery from the corpus of interest. 
The different pattern HMM configurations(number of states per model, 
number of distinct models, number of Gaussians per state)form a three- 
dimensional model granularity space. Different sets of acoustic patterns 
automatically discovered on different points properly distributed over 
this three-dimensional space are complementary to on another, thus can 
jointly capture the characteristics of the spoken terms. By converting the 
spoken content and query into parallel sequences of acoustic patterns 
with different model granularity, token matching can be performed with 
pattern indices representing highly varying signals. Very encouraging 
results were obtained the preliminary experiments. 

2. ACOUSTIC PATTERNS WITH VARYING MODEL 
GRANULARITY 

2.1. Pattern Discovery for a Given Model Configuration 

Given an unlabelled speech corpus, it is not difficult for unsupervised 
discovery of the desired acoustic patterns from the corpus for a cho¬ 
sen hyperparameter set that determines the topology of the HMMs 
This can be achieved by first finding an initial label ujq 
based on a set of assumed patterns for all observations in the corpus x 
as in 0(111. Then in each iteration t the HMM parameter set of can be 
trained with the label ujt-i obtained in the previous iteration as in ([^, 
and the new label cut can be obtained by free-pattern decoding with the 
obtained parameter set of as in 


LJQ 

= initialization(x), 

(1) 

ef 

= argmaxP(x|^'^,tJt_i), 

e'^ 

(2) 


= argmaxP(x|^?,a;). 

UJ 

(3) 


The training process can be repeated with enough number of iterations 
until the difference between ut-i and out becomes insignificant. This 


gives a converged set of acoustic pattern HMMs which we denote as 
0^. 

2.2. Model Granularity Space 


patterns in the set 0^^. We further construct a similarity matrix S of size 
rik X rik for every for which the component j) is the similarity 
between any two pattern HMMs pi and pj in the set 0^^. Two similarity 
matrices used for this work are in (|^. 


The above process can be performed with many different HMM config¬ 
urations, each characterized by three hyperparameters: the number of 
states m in each acoustic pattern HMM, the total number of acoustic 
patterns n during initialization, and the number of Gaussians I in each 
HMM state, = (m, n, 1). The transcription of a speech signal decoded 
with these acoustic pattern HMMs may be considered as a temporal seg¬ 
mentation of the signal, so the HMM length (or number of states in each 
HMM) m represents the temporal granularity. The set of all acoustic 
pattern HMMs may be considered as a segmentation of the phonetic 
space, so the total number n of acoustic pattern HMMs represents the 
phonetic granularity. The different Gaussians in each state then jointly 
model the distributions of the signals in the acoustic feature space repre¬ 
sented by MFCCs, so the number of Gaussians I in each state represents 
the acoustic granularity. This gives a three-dimensional representation 
of the acoustic pattern configurations in terms of temporal, phonetic and 
acoustic granularities as in Fig. Any point in the three-dimensional 
space in Fig. [^corresponds to an acoustic pattern configuration. 

Although the hyperparameters 'ip = are difficult to deter¬ 

mine for a given corpus of unknown language and unknown linguistic 
characteristics, it is possible to have many different sets of acoustic pat¬ 
terns with hyperparameters and HMMs {0'^fc, = (m/e, n/e, 4), A; = 

1,2, ...AT} independently learned in parallel, considered as on differ¬ 
ent levels. Because different model granularities (m, n, 1) give different 
characteristics to the acoustic patterns as mentioned above, with enough 
number of pattern sets and the model granularities (m, n, 1) properly dis¬ 
tributed in the three-dimensional space, this multi-level set of acoustic 
pattern may jointly represent the behavior of the signal for the given cor¬ 
pus. There can be a variety of applications for these patterns, but here in 
the work below we use these patterns in spoken term detection, assum¬ 
ing the characteristics of the spoken terms can be captured by different 
sets of these patterns. 


Phonetic Granularity(n): 

Number of acoustic pattern HMMs 



Fig. 1: The model granularity space for acoustic pattern configura¬ 
tions. 


3. SPOKEN TERM DETECTION AND SEARCH METHODS 
3.1. Off-line Processing 

All the spoken documents in the archive are first off-line decoded into 
sequences of acoustic patterns using each level of acoustic pattern HMM 
set ^ Let {pr^r = 1, 2, 3,.., n/e} denote the acoustic 


{ S{i^j), for hard similarity, or (4a) 

exp(—KL(z, j)//3, for soft similarity. (4b) 

The matrix in is simply the identity matrix. The KL-divergence 
KL(z,j) between two pattern HMMs in l |4b^ is defined as the KL- 
divergence between the states based on the variational approximation 
(23) summed over the states. To transform the KL divergence into a 
similarity measure between 0 and 1, a negative exponential was applied 
(24) with a scaling factor (3. When /3 is small, similarity between distinct 
patterns in ij^ approaches zero, so becomes similar to l |^ . [3 can 
be determined with a held out data set, but here we simply set it to 100 
times the number of states m/j. 


3.2. On-line Matching Matrix Construction 


In the on-line phase, we perform the following for each entered spoken 
query q and each document d in the archive. Assume for the pattern set 
0^fc a document d is decoded into a sequence of D acoustic patterns 
with indices (di, ^ 2 , •••, d^)) and the query q into a sequence of Q pat¬ 
terns with indices (g'l,g'g). We thus construct a matching matrix W 
of size D X Q for every document-query pair, in which each entry (z, j) 
is the similarity between acoustic patterns with indices di and qj as in 
JSal. 


W{i,j) = 


S{di^ qj), 
P^SPj, 


for 1-best sequences, or (5a) 
for N-best sequences. (5b) 


Alternatively, we can also match the N-best sequences of documents to 
N-best sequences of queries as depicted in Fig and ( |5b) . We extend 
each pattern position i in sl sequence into a posteriorgram vector Pi of 
size rz/c X 1 by accumulating the duration for each of the rik patterns 
within the pattern boundaries of the best transcription, across the N-best 
transcriptions considered, uniformly weighted and normalized(Pi for d 
and Pj for q). When only the best transcription is considered, ( |^ re¬ 
duces to jSa}- The price paid here is the increased computation time 
by a factor of O(rz^) over that in d^. One can also choose to use the 
posteriorgram only on the query, which increases the load by 0(n/e). 
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Fig. 2: Construction of pattern posteriorgram from N-best list 


3.3. On-line Matching Policy 

There can be two methods to calculate the relevance score between doc¬ 
ument d and query q. In the sub-sequence matching(SUB) method in 
l |6^ , we sum the elements in the matrix W along the diagonal direction, 
generating the accumulated similarities for all sub-sequences starting at 
all pattern positions in d as shown in Fig i3)(a). The maximum is se¬ 
lected to represent the relevance between document d and query q on the 





























































the computational load can be scalable with the computational resources 
available. 


pattern set 0 ^^ as in l [^ . 


R{d, q) 


Q 

max W{i -\- j, j), 
i-.indexj^^ 


max ^W{ud{j),Uq{j)), 
n:path^.^^ 


for SUB, or ( 6 a) 

for DTW. ( 6 b) 


In order to alleviate the problem of insertion/deletion, we can also per¬ 
form dynamic time warping (DTW) on the matrix W as in ( | 6 b} and Fig 
ib), refered to as the pattern-based DTW here. This is an extended 
version of l [ 6 ^ , except now the elements W are accumulated along any 
allowed DTW path in the matrix W. The summation in l [ 6 b) is over 
a single DTW path, while the maximization in l [ 6 bt is over all DTW 
paths. Although DTW takes longer time, performing DTW in the ma¬ 
trix W on-line is significantly faster than the conventional frame-based 
DTW, because most of the calculation was performed offline when eval¬ 
uating Since table lookup is constant time operation, asymptoti¬ 
cally speaking, the computational load online is reduced by a factor of 
0(Fr2) = O(Fml), where F is the feature dimension in the frame- 
based DTW and T is the duration of the acoustic patterns in frames 
which scales linearly with number of states in HMMs. 

(a) Subsequence matching on (b) Dynamic Time Warping on 
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Fig. 3: The matching matrix W 


4. EXPERIMENTS 

The proposed approach was tested in preliminary experiments performed 
on the TIMIT corpus. 20 sets of acoustic patterns with number of states 
m=3, 5, 7, 9 ,11 and number of HMMs n=50,100, 200 , 300 were first 
generated with 1=1 Gaussian per state on the TIMIT training set. Then 
we increased the number of Gaussians per state by 1 and perform 
and ^ until the sets converge. We repeated the process until we had 
1=1, 2 , 3,4 for all the 20 pattern sets, obtaining K=80 pattern HMM sets 
in the end. With the 8 search methods mentioned in section 3, this gave 
a total of 640 scores for every query-document pair in ([^. The TIMIT 
training set was also taken as the spoken archive from which we wish to 
detect the spoken terms. 

The query set consisted of 32 spoken words randomly selected from 
the TIMIT testing set. An instance of every query word was randomly 
selected from the testing set, and used as the spoken query to search 
for other instances in the training set. Note that although choices of 
can be based on the optimization with respect to an evalua¬ 
tion metric (25) (26), the preliminary experiments in this work were to 
verify the feasibility of the proposed frameworks instead. The baseline 
we compared to was frame-based DTW on MFCC sequences. The same 
acoustic features were used for training the pattern HMMs. In principle, 
the framework should generalize to other features as well. For spoken 
term detection the performance measures we used were the mean aver¬ 
age precision(MAP), precision at 5 and 10(P@5 and P@10). All three 
measures gave very similar trends. Below we only report results for 
MAP due to space limitation. 


3.4. Overall Relevance Score 

In each of (|^(|^([^ there are two options, leading to a total 8 search 
methods. We thus use three binary digits to specify these methods in re¬ 
porting experimental results below: 7 = (Soft,Nbest,DTW), i.e. Soft=l 
for soft similarity in ( [4^ and 0 for l [^ ; Nbest=l for N-best sequence in 
^ and 0 for DTW=1 for DTW in @ and 0 for These 
search methods,{ 75 , s = 1 ,..., 8 } give different relevance scores for 
each pattern set 0 '^fc, R^'^k^s) as in |^. The overall relevance 
score R{d, q) between d and q is then simply the weighted sum of § 
over all K different sets of acoustic patterns 0^^ and the 8 search meth¬ 
ods with weights X{'ipk,ls) as in ([^. Below, we simply set the weights 
7 s) to either 0 or 1 . 

K 8 

R{d, q) = EE 

k=ls=l 

Note that the acoustic pattern sets {0^fc ,k = 1 , 2 ,..., AT} for different 
model granularities are complementary to one another. By adding the 
scores obtained via different pattern sets as in ([^ the signal characteris¬ 
tics can be better captured. In addition, the limitation caused by temporal 
granularity (model length m^) may be alleviated to some degree by the 
pattern-based DTW in ( [ 6 b) , the limitation caused by phonetic granular¬ 
ity (number of patterns Uk) may be alleviated to some degree by the N- 
best sequences in l |5b) , and the limitation caused by acoustic granularity 
(number of Gaussian Ij.) may be taken care to some degree by the soft 
similarity in ( j^ . Therefore, the choices of and 7 ^ are correlated 
for good scores of R{d, q) in (|^. Also, the process above, including 
constructing the KL-divergence matrices and decoding the spoken docu¬ 
ments into acoustic patterns off-line, and generating the pattern sequence 
for the query on-line can all be performed in a highly parallel manner, so 


4.1. Feature Selection and Achievable Performance 

In this experiment, our goal was to learn the 640 weights A('0, 7 ) in (|^ 
to be either 1 or 0 in order to optimize the MAP. We randomly split the 
query set into 2 disjoint sets A and B, each containing 16 queries. We 
use set B as the development set for learning these weights and focus our 
discussion on the performance on set A. Starting with every A('0, 7 )= 0 , 
we greedily select the next ('0, 7 ) that would yield the best MAP and set 
it to 1. This process was repeated until 20 pairs of (V^, 7 ) were selected. 
The results are shown in pink in Fig. [^ in which the oracle results by 
learning on set A itself are also shown in cyan. The baseline of frame- 
based DTW was 10.16% for set A. It was low probably because most 
queries unexpectedly had less than 10 relevant documents in the training 
set, and the various dialects of TIMIT made it even more difficult for 
simply comparing the feature sequences. We can clearly see with only 
20 out of 640 scores selected based on set B, the MAP reaches 25.88%, 
significantly higher than the baseline. The oracle results learning on 
set A itself reached as high as 35.60%, which implied that if a more 
elaborate learning algorithm was applied, there was still much room for 
improvement. 

The parameter sets (V^, 7 ) for the 20 scores selected based on set B 
are also printed on Fig. [^ Some observations can be made here. In 
all the 20 cases Soft=l, so Soft=l was certainly better. For the selec¬ 
tion of N-best sequences or DTW there were no clear trends. However, 
some correlation between search methods and pattern configurations can 
be observed. For example, it can be found that Nbest=l were preferred 
when 1=1 or 1=2, probably because too few number of Gaussians limited 
the accuracies in the 1-best sequence. Also, DTW=1 was preferred very 
often when m=5 or 3, probably because for shorter patterns it made bet¬ 
ter sense to merge more than one patterns into a longer pattern, which is 
actually what the pattern-based DTW did. These results seem to imply 



the need to learn from a development set, which is not always available. 
This will be further discussed in the next section. 
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Fig. 4: MAP performance on query set A with top 20 scores selected 
from a developement set. 


4.2. Performance Analysis without a Development Set 

Here we consider the case without a development set. We first consider 
the different search method 7 by summing all the 80 scores for all com¬ 
binations of m,n,l, but not 7 as shown in Part (A) of Fig. Several 
conclusions may be drawn: (a) Soft similarity brought massive improve¬ 
ment (Soft=l > Soft=0 for all combinations of Nbest and DTW), (b) 
N-best Sequences brought only a negligible improvement (Nbest=l ~ 
Nbest=0 for all combinations of Soft and DTW), (c) DTW degraded the 
performance in general (DTW=1 < DTW=0 for all combinations of Soft 
and Nbest). Conclusion (a) is consistent with the conclusion drawn from 
the top 20 scores in Fig|^ Since generating a soft similarity metric is also 
the only part of the search that could be conducted off-line, it is certainly 
attractive. Conclusion (b) and (c) may be a little surprising, since intu¬ 
itively N-best sequences and DTW can help and quite several of the top 
20 scores in Fig|^had Nbest=l or DTW=1. As discussed previously with 
Fig Nbest and DTW could be helpful for specific pattern configura¬ 
tions but not necessarily all. When such specific configurations cannot 
be properly chosen with a development set, these improvements could 
be diluted when averaged with all possible configurations. Because the 
different pattern sets carried complementary information, jointly consid¬ 
ering the 80 1 -best sequences obtained from the 80 pattern sets itself can 
be viewed as considering a single large lattice best representing the ut¬ 
terance with all time warping and N-best information included. This is 
probably why N-best and DTW didn’t help here. The MAP obtained by 
summing all 640 scores without any selection was 20.62% as shown in 
Part(A) of Fig which was also significantly higher than the baseline. 

Therefore below we focus our discussion on the selection of pattern 
configurations of assuming 7 =( 1,0,0) without using Nbest or DTW. 
The 80 scores for different forms a 3 dimensional space over (m, n, 1). 
We summed the relevance score over 2 of the dimensions and plotted the 
performance on the remaining dimension. Summing over (m, n), (m, /), 
(n, 1), we get Parts (B)(C)(D) of Fig. ^ respectively. The average of 
the MAP values for Parts (B)(C)(D) of Fig. ^ are also listed for com¬ 
parison between the dimensions. Several additional conclusions may be 
drawn: (d) the performance is better for larger I as shown in Part(B) of 
Fig. (e) Model combination on the (m, n) plane was the most effec¬ 
tive (the average MAP of Part(B) was much higher than those in Part 
(C)(D) of Fig. [^, (f) several optimal values seem to exist for m and 
n (the maximum occurs for m=3 in Part (D) and n=100 in Part (C)). 
We highly suspect that these optimal points for m, n as mentioned in 
Conclusion (f) are inert characteristics of the underlying language that 
should stay approximately the same for a different corpus of the same 


language. Conclusion (d) will probably hold until the phenomenon of 
over-fitting begins to happen, since the number of Gaussians is limited 
by the training corpus size. Conclusion (e) implies when blending score 
with respect to (m, n), a good policy may be to have m and n as diverse 
as possible, if there is no information regarding the selection of m, and 
n. From conclusion (f), m=3 seemed close to the average phoneme du¬ 
ration of TIMIT, while n =100 seemed close to the number of phonemes 
with some context dependency considered. This may imply selection of 
(m, n) is a language dependent characteristic although we do not have 
strong evidence yet to back up this claim. This would be useful if ver¬ 
ified to be true, especially for under-resourced languages, since in that 
case the learned weights could be similarly useful for different corpora 
on the same language. 



set 


We further plot the MAP of (d, q) for 7 =( 1,0,0) and 1=3 for 

sets A and B in Fig[^ the best performing 7 and I in Parts (A)(B) of Fig 
1^ The 20 points for (m, n) were interpolated with a 2D spline function 
to show the smoothed MAP distributions over the plane. As can be seen, 
the score distributions look similar to a good degree even for completely 
different query sets in Fig[^a) and (b). When we simply selected the 20 
scores for set A here (m=3,5,7,9,ll; n=50,100,200,300; /= 3 , 7 =( 1,0,0)), 
the MAP is 26.32%, slightly higher than 25.88% achieved above by the 
20 scores selected greedily from a development set. 



Fig. 6: MAP for 7 =( 1 , 0 , 0 ) and 1=3 over the (m, n) plane for query 
sets A and B. 


5. CONCLUSION 

In this work, we present a new approach for unsupervised spoken term 
detection using multi-level acoustic patterns discovered from the target 
corpus. The different pattern sets with different model configurations 
are complementary, thus can jointly capture the information for the spo¬ 
ken terms. Significantly better performance than frame-based DTW on 
TIMIT corpus as obtained. 
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